26 research outputs found

    High-Dimensional Software Engineering Data and Feature Selection

    Get PDF
    Software metrics collected during project development play a critical role in software quality assurance. A software practitioner is very keen on learning which software metrics to focus on for software quality prediction. While a concise set of software metrics is often desired, a typical project collects a very large number of metrics. Minimal attention has been devoted to finding the minimum set of software metrics that have the same predictive capability as a larger set of metrics – we strive to answer that question in this paper. We present a comprehensive comparison between seven commonly-used filter-based feature ranking techniques (FRT) and our proposed hybrid feature selection (HFS) technique. Our case study consists of a very highdimensional (42 software attributes) software measurement data set obtained from a large telecommunications system. The empirical analysis indicates that HFS performs better than FRT; however, the Kolmogorov-Smirnov feature ranking technique demonstrates competitive performance. For the telecommunications system, it is found that only 10% of the software attributes are sufficient for effective software quality prediction

    An Empirical Investigation of Filter Attribute Selection Techniques for Software Quality Classification

    Get PDF
    Attribute selection is an important activity in data preprocessing for software quality modeling and other data mining problems. The software quality models have been used to improve the fault detection process. Finding faulty components in a software system during early stages of software development process can lead to a more reliable final product and can reduce development and maintenance costs. It has been shown in some studies that prediction accuracy of the models improves when irrelevant and redundant features are removed from the original data set. In this study, we investigated four filter attribute selection techniques, Automatic Hybrid Search (AHS), Rough Sets (RS), Kolmogorov-Smirnov (KS) and Probabilistic Search (PS) and conducted the experiments by using them on a very large telecommunications software system. In order to evaluate their classification performance on the smaller subsets of attributes selected using different approaches, we built several classification models using five different classifiers. The empirical results demonstrated that by applying an attribution selection approach we can build classification models with an accuracy comparable to that built with a complete set of attributes. Moreover, the smaller subset of attributes has less than 15 percent of the complete set of attributes. Therefore, the metrics collection, model calibration, model validation, and model evaluation times of future software development efforts of similar systems can be significantly reduced. In addition, we demonstrated that our recently proposed attribute selection technique, KS, outperformed the other three attribute selection techniques

    Mining Data from Multiple Software Development Projects

    Get PDF
    A large system often goes through multiple software project development cycles, in part due to changes in operation and development environments. For example, rapid turnover of the development team between releases can influence software quality, making it important to mine software project data over multiple system releases when building defect predictors. Data collection of software attributes are often conducted independent of the quality improvement goals, leading to the availability of a large number of attributes for analysis. Given the problems associated with variations in development process, data collection, and quality goals from one release to another emphasizes the importance of selecting a best-set of software attributes for software quality prediction. Moreover, it is intuitive to remove attributes that do not add to, or have an adverse effect on, the knowledge of the consequent model. Based on real-world software projects’ data, we present a large case study that compares wrapper-based feature ranking techniques (WRT) and our proposed hybrid feature selection (HFS) technique. The comparison is done using both threefold cross-validation (CV) and three-fold cross-validation with risk impact (CVR). It is shown that HFS is better than WRT, while CV is superior to CVR

    Robust estimation of bacterial cell count from optical density

    Get PDF
    Optical density (OD) is widely used to estimate the density of cells in liquid culture, but cannot be compared between instruments without a standardized calibration protocol and is challenging to relate to actual cell count. We address this with an interlaboratory study comparing three simple, low-cost, and highly accessible OD calibration protocols across 244 laboratories, applied to eight strains of constitutive GFP-expressing E. coli. Based on our results, we recommend calibrating OD to estimated cell count using serial dilution of silica microspheres, which produces highly precise calibration (95.5% of residuals <1.2-fold), is easily assessed for quality control, also assesses instrument effective linear range, and can be combined with fluorescence calibration to obtain units of Molecules of Equivalent Fluorescein (MEFL) per cell, allowing direct comparison and data fusion with flow cytometry measurements: in our study, fluorescence per cell measurements showed only a 1.07-fold mean difference between plate reader and flow cytometry data

    A Comparative Study of Filter-based Feature Ranking Techniques

    Get PDF
    One factor that affects the success of machine learning is the presence of irrelevant or redundant information in the training data set. Filter-based feature ranking techniques (rankers) rank the features according to their relevance to the target attribute and we choose the most relevant features to build classification models subsequently. In order to evaluate the effectiveness of different feature ranking techniques, a commonly used method is to assess the classification performance of models built with the respective selected feature subsets in terms of a given performance metric (e.g., classification accuracy or misclassification rate). Since a given performance metric usually can capture only one specific aspect of the classification performance, it may be unable to evaluate the classification performance from different perspectives. Also, there is no general consensus among researchers and practitioners regarding which performance metrics should be used for evaluating classification performance. In this study, we investigated six filter-based feature ranking techniques and built classification models using five different classifiers. The models were evaluated using eight different performance metrics. All experiments were conducted on four imbalanced data sets from a telecommunications software system. The experimental results demonstrate that the choice of a performance metric may significantly influence the classification evaluation conclusion. For example, one ranker may outperform another when using a given performance metric, but for a different performance metric the results may be reversed. In this study, we have found five distinct patterns when utilizing eight performance metrics to order six feature selection techniques

    Statistical Inference for the Inverted Scale Family under General Progressive Type-II Censoring

    No full text
    Two estimation problems are studied based on the general progressively censored samples, and the distributions from the inverted scale family (ISF) are considered as prospective life distributions. One is the exact interval estimation for the unknown parameter θ , which is achieved by constructing the pivotal quantity. Through Monte Carlo simulations, the average 90 % and 95 % confidence intervals are obtained, and the validity of the above interval estimation is illustrated with a numerical example. The other is the estimation of R = P ( Y < X ) in the case of ISF. The maximum likelihood estimator (MLE) as well as approximate maximum likelihood estimator (AMLE) is obtained, together with the corresponding R-symmetric asymptotic confidence intervals. With Bootstrap methods, we also propose two R-asymmetric confidence intervals, which have a good performance for small samples. Furthermore, assuming the scale parameters follow independent gamma priors, the Bayesian estimator as well as the HPD credible interval of R is thus acquired. Finally, we make an evaluation on the effectiveness of the proposed estimations through Monte Carlo simulations and provide an illustrative example of two real datasets

    Multi-Objective Optimization by CBR GA-Optimizer for Module-Order Modeling

    No full text
    In the case when resources allocated for software quality improvement are limited or unknown, an estimation of the relative rank-order of modules based on a quality factor such as number of faults is of practical importance to the software quality assurance team. This is because improvements can be targeted toward a set of most faulty modules according to resource availability. A module-order model (MOM) can be used to determine the relative rank-order of modules. A MOM usually ranks the modules according to the predicted number of faults obtained from an underlying quantitative prediction technique, such as multiple linear regression and case-based reasoning. In this paper we propose a computational intelligence-based method for targeting the performance behavior of MOM(s). The method maximizes the number of faults accounted for by the given percentage of modules enhanced. A new modeling tool called CBR GA-optimizer is developed through a synergy of genetic algorithms (GA) and case-based reasoning (CBR). The tool automatically finds the best CBR fault prediction models according to a project-specific objective function.

    An Application of a Rule-Based Model in Software Quality Classification

    No full text
    Abstract A new rule-based classification model (RBCM

    A Brain-Inspired Decision-Making Linear Neural Network and Its Application in Automatic Drive

    No full text
    Brain-like intelligent decision-making is a prevailing trend in today’s world. However, inspired by bionics and computer science, the linear neural network has become one of the main means to realize human-like decision-making and control. This paper proposes a method for classifying drivers’ driving behaviors based on the fuzzy algorithm and establish a brain-inspired decision-making linear neural network. Firstly, different driver experimental data samples were obtained through the driving simulator. Then, an objective fuzzy classification algorithm was designed to distinguish different driving behaviors in terms of experimental data. In addition, a brain-inspired linear neural network was established to realize human-like decision-making and control. Finally, the accuracy of the proposed method was verified by training and testing. This study extracts the driving characteristics of drivers through driving simulator tests, which provides a driving behavior reference for the human-like decision-making of an intelligent vehicle
    corecore